Psychometric performance of the Primary Mitochondrial Myopathy Symptom Assessment (PMMSA) in a randomized, double-blind, placebo-controlled crossover study in subjects with mitochondrial disease

Background The Primary Mitochondrial Myopathy Symptom Assessment (PMMSA) is a 10-item patient-reported outcome (PRO) measure designed to assess the severity of mitochondrial disease symptoms. Analyses of data from a clinical trial with PMM patients were conducted to evaluate the psychometric properties of the PMMSA and to provide score interpretation guidelines for the measure. Methods The PMMSA was completed as a daily diary for approximately 14 weeks by individuals in a Phase 2 randomized, placebo-controlled crossover trial evaluating the safety, tolerability, and efficacy of subcutaneous injections of elamipretide in patents with mitochondrial disease. In addition to the PMMSA, performance-based assessments, clinician ratings, and other PRO measures were also completed. Descriptive statistics, psychometric analyses, and score interpretation guidelines were evaluated for the PMMSA. Results Participants (N = 30) had a mean age of 45.3 years, with the majority of the sample being female (n = 25, 83.3%) and non-Hispanic white (n = 29, 96.6%). The 10 PMMSA items assessing a diverse symptomology were not found to form a single underlying construct. However, four items assessing tiredness and muscle weakness were grouped into a “general fatigue” domain score. The PMMSA Fatigue 4 summary score (4FS) demonstrated stable test–retest scores, internal consistency, correlations with the scores produced by reference measures, and the ability to differentiate between different global health levels. Changes on the PMMSA 4FS were also related to change scores produced by the reference measures. PMMSA severity scores were higher for the symptom rated as “most bothersome” by each subject relative to the remaining nine PMMSA items (most bothersome symptom mean = 2.88 vs. 2.18 for other items). Distribution- and anchor-based evaluations suggested that reduction in weekly scores between 0.79 and 2.14 (scale range: 4–16) may represent a meaningful change on the PMMSA 4FS and reduction in weekly scores between 0.03 and 0.61 may represent a responder for each of the remaining six non-fatigue items, scored independently. Conclusions Upon evaluation of its psychometric properties, the PMMSA, specifically the 4FS domain, demonstrated strong reliability and construct-related validity. The PMMSA can be used to evaluate treatment benefit in clinical trials with individuals with PMM. Trial registration ClinicalTrials.gov identifier, NCT02805790; registered June 20, 2016; https://clinicaltrials.gov/ct2/show/NCT02805790. Supplementary Information The online version contains supplementary material available at 10.1186/s41687-022-00534-y.


Background
Primary mitochondrial diseases (PMD) are a group of rare, clinically heterogeneous disorders resulting from over 350 different genetic mutations of the nuclear DNA (nDNA) and/or mitochondrial DNA (mtDNA) [1][2][3]. Primary mitochondrial myopathy (PMM) refers to PMD with predominant, but not exclusive, involvement of muscles, leading to defects in oxidative phosphorylation across various muscle groups, including skeletal and cardiovascular muscles [4][5][6]. Among the PMD population, with an estimated incidence of 1 in 4300 to 10,000 [7][8][9], it is expected that approximately 90-95% of patients may experience PMM, although the exact prevalence of PMM is unknown [10,11]. PMM is characterized by a variable signs and symptoms experience, including fatigue, muscle weakness, pain, and exercise intolerance [12,13]. As a result of this vast array of symptoms, patients report having difficulties with independent and safe ambulation, understanding conversation in noisy settings, driving, personal hygiene, and reading. Social, emotional, and economic concerns also plague adult patients with PMM. In addition, the daily management of symptoms for these patients can be overwhelming. Given this substantial negative impact on aspects of quality of life [14], it is important to consider effects on symptoms when testing novel treatments for this population [15].
To date, there have been no successful PMD clinical trials, partly due to a lack of disease-specific patient outcome measures [16]. Several of the outcomes that characterize PMD are best measured via self-report; however, there is limited use of patient-reported outcome (PRO) symptom measures in PMM studies and existing measures may not be well suited to do so [17,18]. For example, the Newcastle Mitochondrial Disease Adult Scale (NMDAS) [19] is an assessment of physical functioning and disease severity based on both clinical assessment and patient/caregiver interviews. Although clinician and caregiver perspectives are important, patient selfreports may provide a more direct and accurate measure of symptom severity and function limitations. In addition, The Newcastle Mitochondrial Quality of Life measure (NMQ) [20], a PRO questionnaire, addresses health-related QoL, rather than focusing on the details of the signs and symptoms associated with mitochondrial disease, which may be more important in understanding the direct effects of the disease pathophysiology on the patient. Moreover, while both the NMDAS and NMQ were developed and tested in a mitochondrial disease population, neither was developed specifically for use with individuals with the PMM subtype [19,20]. Given the heterogeneity of mitochondrial disease, subtype-specific assessments may be warranted for adequate measurement of treatment benefit [17]. Further, the NMDAS is intended for use in six-to twelve-month intervals and the NMQ has a four-week recall period. These relatively long intervals are not well suited to capture the effects of new treatments on symptoms, which may appear in a shorter amount of time. Regulatory guidelines recommend the use of shorter recall periods in PRO measures to be utilized in clinical trials [21,22]. For example, the Food and Drug Administration Guidance on PROs states that "short recall periods or items that ask patients to describe their current or recent state are usually preferable" [23 (p.14)].
Currently, there is a lack of robust, clinically meaningful, validated clinical trial outcome measures to provide for the optimal efficacy evaluation of novel treatments for patients with PMDs, such as PMM [16]. To fill the gap in available PRO measures that can be used to assess the signs and symptoms of PMM, the Primary Mitochondrial Myopathy Symptom Assessment (PMMSA) was developed. The PMMSA is a ten-item daily assessment evaluating symptom severity over the past 24 h and was developed through extensive patient interviews (42 interviews were conducted) to ensure its content validity. Specifically, the symptoms assessed by the PMMSA were demonstrated to be relevant to individuals with PMM and these individuals understood the instructions, items, and response scales of the measure and were able to provide meaningful responses to the items upon administration [24].
The goal of the current study was to examine the quantitative measurement characteristics of the PMMSA. This included an assessment of the dimensionality of the measure and the reliability, construct validity, and sensitivity to change of the PMMSA scores. This testing was accomplished using data from a Phase 2 randomized, double-blind, placebo-controlled crossover study in subjects with mitochondrial disease in order to inform its inclusion and performance in a subsequent Phase 3 trial. Because mitochondrial disease is a rare disease, the trial included fewer patients than would normally be used for testing of measurement characteristics. Our analyses utilized standard psychometric approaches when possible, with the use of additional innovative methods to account for the relatively small sample size.

Patient selection criteria
Subjects included in this study provided informed consent prior to participation. To be selected for inclusion, subjects must have met all of the inclusion and none of the exclusion criteria. Broadly, North American male and female consenting adults with a diagnosis of PMM as selected by the Investigators to participate in an earlier Phase 1/2 study of elamipretide were eligible. Those with medical conditions that could put them at risk, who had adverse reactions to the study drug in the Phase 1/2 study, or who were actively enrolled in another trial were excluded [25,26].

Assessments
Subjects were asked to complete assessments during study center visits at Screening (Visit 1), Baseline/Day 1 (Visit 2), and at the end of Week 4 (Visit 3), Week 8 (Visit 4), Week 12 (Visit 5), and Week 14 (Visit 6). Additionally, patients completed the PMMSA outside of the clinic via an electronic daily diary for 14 weeks, starting from the Screening Visit and continuing to the End-of-Study (14 weeks) or Early Discontinuation Visit.

PMMSA
The PMMSA is a 10-item PRO questionnaire that assesses tiredness at rest, tiredness during activities, muscle weakness at rest, muscle weakness during activities, balance problems, vision problems, abdominal discomfort, muscle pain, numbness, and headache over the  previous 24 h on a four-point verbal rating scale (VRS) ranging from 1-Not at all to 4-Severe (Additional file 1: Table S1). As a once-daily diary, subjects completed electronic versions of the PMMSA between 6:00 pm and 11:59 pm beginning with the screening visit and through the end of the study (14 weeks) or until early discontinuation. A principal use of the results presented here was to inform appropriate scoring for the PMMSA. Item scores: Each item's daily score is reflected by a 1 to 4 VRS where higher scores reflect more severe patientreported symptom involvement.
Domain scores: Two fatigue domains, which resonate with the descriptive language most often used by patients, were hypothesized, including a four-item fatigue scale (4FS) made up of Items 1, (tiredness at rest), 2 (tiredness during activities), 3 (muscle weakness at rest), and 4 (muscle weakness during activities) and a two-item fatigue scale (2FS) made up of Items 2 (tiredness during activities) and 4. (muscle weakness during activities). Both the 4FS and 2FS are derived by summing the item scores for each day. Responses to at least three of the four items in the 4FS and both items in the 2FS were required to calculate a daily score. If the 4FS daily item responses met the noted criteria, prorated summed scores were found by averaging the available item responses and then multiplying that value by the total number of items; the prorated summed score approximates the summed score for the individual if all items had been answered and is equivalent to individual item mean substitution for the missing item responses.
Total symptom scale (TSS): The TSS score is calculated as the sum of the 10 items. A prorated summed score was calculated if the patient responded to at least 7 of the 10 items.
Daily and weekly scores: Both daily and weekly scores were derived for PMMSA items, domains (4FS and 2FS), and the TSS. Weekly scores were derived as the average daily value from the preceding seven days of a target analysis day. For example, if the Baseline visit (Day 1) is the target analysis day, then the Baseline weekly score is the average of scores generated from study Days 0, − 1, − 2, − 3, − 4, − 5, and − 6.

Reference measures
Subjects in the study completed the following assessments which served to support the psychometric evaluation of the PMMSA: • Quality of Life in Neurological Disorders (Neuro-QoL) Fatigue short form [27]; an eight-item instrument assessing fatigue (five-point VRS ranging from 1-Never to 5-Always); scores were obtained using "look up tables" (i.e., summed score to expected a posteriori score conversion tables) provided by Neuro-QoL; • Physician Global Assessment (PhGA); a single-item assessment in which the clinician rates the study subject's overall health status (five-point VRS ranging from 1-Excellent to 5-Poor); • Patient Global Assessment (PGA), a single-item assessment in which the study subject rates his or her own overall health status (five-point VRS ranging from 1-Excellent to 5-Poor); • Six-minute walk test (6MWT) [28]; the distance, in meters, that a subject covers during a six-minute period and two self-report items assessing shortness of breath and fatigue before and after the six-minute walk (12-point modified Borg scale); • Triple Timed Up and Go (3TUG) Test [29]; the time, in seconds, for the subject to complete three repetitions of standing from a seated position, walking 10 feet, and returning to a seated position; • Scale for the Assessment and Rating of Ataxia (SARA) [30]; an assessment of ataxia in which a clinician rates the subject on eight domains, with higher scores indicating higher ataxia severity; • The purpose of including an ataxia severity measure in this study relates to the tendency for this patient population to experience effects on the nervous system which can manifest in balance issues; and • "Most bothersome item, " a single item assessment prompting subjects to rate the one symptom deemed to be the most bothersome among the 10 symptom concepts assessed in the PMMSA. This item was administered at the Screening Visit only, prior to the first completion of the PMMSA.
All reference measures, except for the "most bothersome" item, were administered during each of the study center visits. The Neuro-QoL Fatigue item bank was additionally administered during the nurse home visit at Week 6. Patients were not compensated specifically for the completion of any of the PRO measures.

Analyses
All analyses were conducted in SAS 9.4 and focused on evaluating the performance of PMMSA item scores and the a priori PMMSA domain scores. The details are discussed in relevant sections.

Descriptive statistics
To examine individual item response distributions and PMMSA summary score properties (4FS, 2FS, and TSS) at the daily level, a subset of collected days (Days were examined. Weekly scores for the PMMSA items, 4FS, 2FS, and TSS (using PMMSA ratings from the seven days preceding a target timepoint [i.e., clinic visits/nurse home visits]) were also calculated.

Dimensionality analysis
As a newly developed, potentially multi-dimensional scale, PMMSA data were summarized to see if stable patterns among items emerged in order to increase understanding of the scale structure and inform future scoring of the tool. Due to the relatively small sample size, a repeated measures approach was used that capitalized on multiple data points from the same patient. Specifically, daily responses for Day -6, Day 1, and every subsequent 10 th day were combined into a dimensionality analyses dataset. These days were selected to capture pre-intervention days and then, post-intervention, were selected to reduce the potential for auto-correlations between days close together temporally. This data was used to generate polychoric correlations among the PMMSA items and were also submitted to categorical exploratory factor analysis (CEFA), which adapts typical continuous variable factor analytic methods to appropriately accommodate categorical data (such as the PMMSA 4-category response options) [31]. While the sampling of multiple days will produce biased standard errors (due to the non-independence of observations), it provides unbiased point estimates (e.g., factor loadings), allowing for a general review of the likely underlying measurement structure of the PMMSA items.

Reliability analysis
The reliability of PMMSA scores was assessed in two ways. First, internal consistency estimates of the PMMSA TSS, 4FS, and 2FS were assessed via Cronbach's coefficient alpha (α) generated at each of the daily analysis time points (Day − 6, Day 1, and every subsequent 10th day up to Day 101]). Second, test-retest reliability estimates for PMMSA item, TSS, 4FS, and 2FS scores were calculated as the Pearson correlation coefficients relating weekly scores generated during (1) screening Week 1 (Days − 13 to − 7) and screening Week 2 (Days − 6 to 0) and (2) Week 7 and Week 8. These times were selected because patients were expected to have stable symptoms during these intervals. Internal consistency and test-retest reliability were examined. Reliability estimates at or greater 0.70 was considered acceptable [35].

Construct-related validity analysis
First, convergent and discriminant validity was assessed by cross-sectional Pearson's and Spearman's correlation coefficients generated between weekly scores produced by the PMMSA and the reference measures administered at Visits 2 through 6. Second, known-groups analyses were planned by comparing the PMMSA weekly scores within pre-specified groupings determined by the PhGA, PGA, and 6MWT scores via independent samples t-tests within each time point. Specifically, known groups were defined by (1) PGA scores (two groupings comprising those who self-report their health status as "Excellent" or "Very Good" and "Poor" or "Fair"), (2) PhGA scores (two groupings comprising subjects with clinician-rated health status as "Excellent" or "Very Good" and "Poor" or "Fair"), and (3) 6MWT results (two groupings comprising subjects who covered 400 m or more and those who covered less than 400 m) [28]. It was expected that the Excellent/Very Good groups (for the PGA and PhGA analysis) and the ≥ 400 m group (for the 6MWT analysis) would have lower PMMSA weekly scores. Third, sensitivity-tochange estimates were generated to reflect the relationships between weekly PMMSA score changes and change scores observed in the relevant reference measures via Pearson's and Spearman's correlation. The change scores for all relevant measures were generated from the end of the experimental treatment period to the end of the placebo period two.

Score interpretation analysis
Score interpretation analysis informs the clinical meaning that may be attached to observed within-person change. Distribution-based methods and anchor-based methods were used to inform the interpretation of scores and arrive at treatment responder definitions [32]. Distribution-based methods included the 0.5 standard deviation (SD) and standard error of measurement (SEM) [33]. The anchor-based methods used the change in PGA, PhGA, and 6MWT scores as anchors to categorize patients into improved and non-improved groups; the mean changes in PMMSA scores in the improved groups on the anchor measures were reported as candidate responder definitions [34]. The distribution-based methods are presented to provide some information on the degree of variability in the measures at baseline. The responder definitions for the PMMSA scores would be expected to exceed the values of the 0.5 SD and SEM.

Results
Sample demographics: Table 1 presents patient demographics and genotypic characteristics. Thirty-one individuals participated in the SPIMM 202 clinical trial, 30 of whom completed all study activities. The 30 participants had a mean age of 45.30 years, with 83.0% (n = 25) being female and 97.0% being non-Hispanic white.
PMMSA item, domain (4FS and 2FS), and TSS scores: Descriptive statistics for the 10 PMMSA item scores were calculated for Days -6 to 0, Day 1, and each subsequent 10th day through Day 101. The distributions of daily response showed that the patients used the full range of the 1-4 response scale. Items assessing vision problems, abdominal pain, numbness, and headache were generally rated as lower in severity (response means ranged from 1.55 [SD = 0.81] to 2.12 [SD = 1.04] across all analyzed days), while items assessing tiredness at rest, tiredness during activities, muscle weakness at rest, muscle weakness during activities, balance problems, and muscle pain were generally rated as "Mild" or "Moderate" in severity (response means ranging from 2.36 [SD = 0.98] to 2.86 [SD = 0.81] considering all analyzed days). Similarly, and consistent with expectations that not all PMD patients will experience all PMD symptoms, all items were endorsed as "not at all" by at least one subject on each Table 1 Demographic and genotypic characteristic summary (N = 30) [25] One subject failed to meet inclusion criteria at Visit 2 and was discontinued from the study. One subject discontinued from the study after the Week 10 home visit (between Visit 4 and Visit 5). Both participants' demographic information collected at screening is included in the above summaries 6MWT = six-minute walk test; SD = standard deviation  Table 2. In general, weekly score averages for items assessing tiredness, muscle weakness, balance problems, vision problems, and muscle pain were between moderate and severe, whereas weekly averages for items assessing abdominal discomfort, numbness, and headache were between mild and moderate. Overall, the 4FS and 2FS severity scores were slightly higher than the TSS, when considering the number of items contributing to each score; however, all weekly summary scores were generally between "mild" to "moderate" on the response scale.
Inter-item correlations: As can be seen in Table 3, there is a strong item cluster among the first 4 PMMSA items (tiredness at rest, tiredness during activities, muscle weakness at rest, muscle weakness during activities) in which those items are more interrelated to each other than to the other items in the assessment; this is not unexpected given the similarity/repeated nature of item text/content and supports the a priori 4FS score. PMMSA item 5 (balance problems) is also related to this cluster, but at a lower level.
Categorical exploratory factor analysis: Table 4 presents results of the exploratory factor analyses of the PMMSA items for both the full 10-item set and the a priori defined 4FS item set (factor loadings > 0.40 have been bolded for ease of review) [35]. For the 10-item set, the two-factor solution suggests item clusters of two factors comprising Items 1-5 and 7-10, respectively. However, the domain created by items 7-10 was not considered conceptually coherent and interpretable and was not examined further. For the a priori 4FS item set, all items loaded strongly onto a single factor but there were also strong loadings for the 2 tiredness items on the second factor in the 2-factor solution, indicating that there may be residual dependence among these 2 items (and likely the muscle weakness items) due to content similarity. However, this finding does not preclude these items creating meaningful and useful summed scores within the classical test theory framework.
Based on the results of the inter-item correlations and the factor analysis, the PMSSA TSS and was dropped from further consideration. Additionally, the 2FS can be considered a less accurate version of the 4FS and preference was given to the 4FS which contained more items and therefore more information.

Reliability analyses
Internal consistency: As described in them Methods, coefficient alpha for the 4FS scores were computed across the 12 individual days selected for the day-level analyses. The     Test-retest reliability: Table 5 presents a stable testretest reliability of the weekly 4FS scores for the two periods described in the Methods section. The reliability values were above the threshold considered sufficient to support their use in making individual-level decisions (0.90) and group-level comparisons (0.80). The sample size for Period 1 was very small because not all patients completed a sufficient number of daily assessments during the screening period Week 1 to generate weekly scores. Table 6 presents the average correlations between the weekly 4FS, the individual PMMSA items, and scores produced by the administered reference measures; correlations were assessed cross-sectionally by visit but in the interest of space, we report the mean correlation coefficient across the examined visits. We note that the correlations intended to establish convergent validity were consistent with respect to direction and generally consistent with respect to magnitude across visits (e.g., between 4FS and fatigue item from the 6WMT task observed r's = 0.29, 0.47, 0.35, 0.32, 0.23 across 5 examined visits). The mean correlations, excepting those involving the Neuro-QoL fatigue scale, are based on the average of coefficients generated at Study Table 4 Item loadings and factor correlations from exploratory factor analyses of the PMMSA items, for both the 10-item TSS and 4FS

Construct-related validity analyses
Bold text indicates factor loadings larger than 0.40 in absolute value CEFA = categorical exploratory factor analysis; PMMSA = Primary Mitochondrial Myopathy Symptom Assessment The CEFA is based on scores combined across all assessment days (i.e., the analysis ignores time of assessment and averages scores generated from daily responses for Day − 6 [from screening], Day 1 [Baseline], and every subsequent 10th day [Days 11, 21, and up to Day 101]). Nesting was intentionally ignored. Additionally, in instances in which a response category had less than five observed responses, the responses were recoded as the next lowest response category except for the response option 0-Not at all, which was recoded as the next highest response category. For the CEFA of the 10-item PMMSA, solutions with up to three factors were examined, while a maximum of two factors were used for the Fatigue 4 scale subset. An item was considered to belong to a factor if the loading value was greater than 0. 40   With respect to convergent validity, several a priori hypotheses were confirmed [36,37].
• A positive and strong correlation (r = 0.69) was found between the 4FS and the Neuro-QoL indicators of fatigue. • Consistently positive and mostly moderate correlations (r = 0.30 to 0.49) were found between 4FS and the 6MWT self-reported fatigue score both preand post-6MWT. • Correlation between 4FS and self-reported overall health status (as assessed by the PGA, r = 0.39) was stronger than the physician ratings of the subject's health status (as assessed by the PhGA, r = 0.28). • A positive and small correlation (r = 0.25) was observed between the 4FS and scores from SARA.
Although not a primary consideration for this analysis, a strong positive correlation was found between the SARA and the PMMSA balance item, as anticipated.
• Small correlations were observed between the 4FS and 3TUG time to completion and 6MWT distance traveled score. These correlations were also in the expected directions (i.e., positive for the 3TUG completion time and negative for the 6MWT distance).
Overall, the PMMSA 4FS scores tended to correlate more strongly with fatigue-specific reference variables and were found to have less strong correlations with more distal reference variables (e.g., PhGA, 3TUG time to completion).
The most bothersome symptom item from among the PMMSA items was also evaluated. A total of seven unique symptoms were reported as most bothersome, including muscle weakness during activities (n = 7 or 23.3%), balance problems (n = 6 or 20%), vision problems (n = 5 or 16.7%), tiredness during activities (n = 5 or 16.7%), tiredness at rest (n = 3 or 10%), abdominal discomfort (n = 2 or 6.7%), and muscle pain (n = 2 or 6.7%). No subjects reported muscle weakness at rest, numbness, or headache as most bothersome. The reported severity of the most bothersome symptom at Baseline on the PMMSA (mean = 3.10, SD = 0.80) was greater than the severity of all other individual symptoms combined (mean = 2.35, SD = 0.60). The latter result was replicated when considering all PMMSA response days, in which the mean response for the most bothersome symptom was 2.88 (SD = 0.82) compared to 2.18 (SD = 0.56) for the remaining nine items.
Known-groups analyses: Table 7 presents the average scores on each PMMSA variable across the observation weeks. Due to the small sample sizes in each group standardized differences were reviewed but no inferential tests were performed. As expected, more positive global ratings were associated with numerically lower weekly PMMSA scores and the magnitudes of the relationships were generally strong, particularly for the PGA. The relationship with the 6MWT grouping was not as strong and in the unexpected direction for two PMMSA items (abdominal discomfort and headache).
Sensitivity-to-change analysis: Table 8 presents the correlations among change scores between weekly PMMSA 4FS scores and reference variables, with change defined as the difference in scores between the active treatment period (Visit 3 or 5, depending on order) and the placebo treatment period (Visit 3 or 5, depending on order). Indicators of score sensitivity include consistently (1) positive and strong correlation (r = 0.71) between change in the 4FS and change in Neuro-QoL fatigue scores, (2) negative and moderate correlation (r = − 0.46) between change in the 4FS and change in 6MWT distance, and (3) positive and small correlation (r = 0.21) between change in the 4FS and change in the 3TUG time-to-completion  Table 6 Average correlations between reference variables and PMMSA scores across time points   results. The small correlations between change in the 4FS and change in PhGA and PGA may be due to a lack of variability in the PhGA ratings over time. The correlations of change in PMMSA item scores were generally in the expected direction; however, correlations between the PMMSA items were typically not as strong as those observed with the 4FS.

Discussion
The PMMSA is a PRO daily diary measure that was created to evaluate treatment benefit in regulated PMM clinical trials and developed in accordance with best measurement practices and regulatory guidelines [21][22][23]. With its content validity established [24], results from the present analyses were generated to evaluate the measure's underlying factor structure, scoring algorithm and address its psychometric performance. The CEFA results suggest that the full 10-item PMMSA is multidimensional and a composite or TSS may not be appropriate for this instrument. However, scale dimensionality analyses did suggest the first four items of the PMMSA form a general fatigue item parcel. As the remaining six items were not included as an a priori domain, did not load strongly on a single factor, it is more appropriate to treat them as individual item scores. Therefore, the PMMSA is best represented by 6 scores: fatigue (fouritem composite), balance problems, vision problems, abdominal discomfort, muscle pain, numbness, and headache.
Results support the conclusion that the PMMSA yields scores that are reliable, valid, and sensitive to change over time. Test-retest reliability remained stable producing similar result between the two different assessment points. The pattern of correlations with other variables indicated convergent and discriminant validity: For example, while the PMMSA weekly fatigue score was robustly related to the NeuroQoL Fatigue measure, it was largely unrelated to unrelated concepts, such as height, weight, and ataxia. Specific correlations between select individual items and criterion measures-i.e. the balance problems item and the SARA-also supported the validity of the measure. However, the headache item was not correlated with any of the criterion measures. Known-groups analyses indicated that the PMMSA weekly fatigue scores were robustly related with patient and physician global evaluations of the patient's health. Individual items were also generally related to the global evaluations as expected, particularly with the patient's global assessment. The change in the weekly PMMSA fatigue scores was also strongly associated with change over time on the NeuroQoL Fatigue measure and 6MWT distance. Small to moderate correlations were observed between the other PMMSA weekly scores and the criterion variables. Overall, strong evidence for the measurement characteristics of the PMMSA scores was obtained from the trial, despite the small sample size.
The responder definition analysis indicated that a change of approximately 2 points on the PMMSA weekly fatigue score could be considered meaningful for patients. The average baseline score for the 4FS was 11.3. Therefore, a 2-point improvement represents a 15-20% reduction in the fatigue score. Similarly, the meaningful change thresholds for the individual item scores were generally between 0.25 and 0.50, which reflect 15-20% reductions from baseline. The estimates for the 4FS and tiredness and muscle weakness items exceeded the distributionbased values; this is expected as the distribution-based approaches are group-level analyses and likely underestimate the true responder threshold. However, this was not the case for the other individual items, where the 0.5 SD values were larger than the responder definition estimate. Therefore, these specific item-level estimates should be confirmed in future studies with larger sample sizes and tailored anchor measures, if possible. The responder definition estimates are based on anchor-based analyses that follow regulatory recommendations [23]. However, different methods that are designed to identify meaningful within-patient change (e.g., [38]), could yield different results. Change from baseline on the 6MWT was the only anchor variable that was associated with PMMSA scores above 0.30, a common threshold for identifying suitable anchors [39]; this may be due to limited variability in the patient and clinician global measures. These relatively low correlations may reduce the precision and reliability of the meaningful change estimates.
The heterogeneity of the PMM symptom experience and the challenges in detecting treatment effects using a multi-symptom assessment in this context are formidable. In this context, supplementing the PMMSA with an item that asked subjects to select which PMM symptom they deemed most bothersome at screening was an important aspect of the overall measurement strategy. A total of seven unique symptoms were reported as most bothersome, with muscle weakness during activities being reported most often. Importantly, symptom severity was rated as higher for the endorsed most bothersome symptom relative to other symptoms. This additional item serves to tailor the PMMSA to the individual's experience and could be explored further to determine whether an individual-specific single item could be used in conjunction with a more general mitochondrial disease assessment for future research.
The results presented herein ought to be interpreted with caution and in the context of several limitations. First, many of the analyses were conducted using samples too small to test many of the underlying methodological assumptions and, therefore, not all types of analyses could be implemented and the magnitude of the relationships between variables were reviewed for consistency with a priori hypotheses but rarely tested for statistical significance. These challenges are common when developing PRO measures for rare disease populations [40]. Nevertheless, the analyses produced plausible patterns of results, even suggesting discriminant patterns of relationships between the PMMSA and the criterion variables. In addition, the sample was limited to United States-based, English-speaking participants, lacking cultural diversity in representation of mitochondrial disease populations. To address these limitations, it is recommended that future research assesses the PMMSA among individuals globally [41]. Future studies could also consider the value of other methods for summarizing the daily data, such as examining the most severe score in a week, rather than the average score. Additionally, the PMMSA and other PRO measures were completed in the context of a carefully monitored clinical trial. It may be challenging to administer all of the same measures in a less controlled study.
The study and the PMMSA also had several notable strengths. The tests of reliability, validity, and responsiveness followed expert and regulatory best practices using the intensive within-patient observations to account for the relatively small sample size. The use of the daily diary approach with short-recall period (24 h) is a notable difference from other PRO measures that have been used with mitochondrial disease patients and generic measures of symptoms. Perhaps because of this approach, the PMMSA fatigue score was more sensitive to treatment effects than other PRO measures, as described in a separate publication [25].

Conclusion
The PMMSA is a content-valid PRO measure whose subdomain and individual item scores have been found to be reliable, construct-valid, and interpretable in patients with genetically confirmed mitochondrial disease, specifically, those with mitochondrial myopathy. Also, the PMMSA fatigue scores are suited for use as an independently scored subscale among this population. Other items can be used individually to comprehensively evaluate the patient's symptom burden. The findings suggest that the PMMSA is a valuable tool to examine the patient's perspective on those symptoms that negatively affect their QoL and activities of daily living in clinical trials.